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J PCT/IB95/00667 

Processing system with word-aligned branch target. 



The present invention is directed to a processing system comprising a processor. 
Instruction sets for different processors define the permissible addresses of each 
data item or first byte of any instruction. Known processors can be divided into 2 
categories: 1) Fully-aligned where the address can be only on specific boundaries, normally 
5 byte, word (2-bytes) or doubleword (4-bytes), These processors implement an instruction set 
with (binary) instruction size as a multiple of their alignment size. 2) Non-aligned where the 
address is not restricted to any size aUgnment, i.e., it could start at any (permissible) address 
granularity/ normally at, but not necessarily limited to, any byte address. 

One factor that is of interest for performance evaluation of processors or 
10 microcontrollers, is the way code is read from the memory system, including cache memory, 
read-write memory and read-only memory. Due to the high-speed of internal execution and 
the resulting increased instruction bandwidth requirements of today^s processors and 
microcontrollers, memories are commonly accessed through multi-byte buses (or data paths). 
The simplest way to handle the high-speed access and avoid unnecessary gate delays is to 
15 access the memory at a fixed alignmmt for each code or instruction read. This avoids a 
complex, post fetch alignment scheme at most (or all) levels of the system hierarchy. This 
approach is universally accepted in high-performance or multi-byte oriented processor 
implementations. It works particularly well for the fully-aligned architectures discussed 
above. 

20 Another factor of interest in today's microcontrollers is code density. Each 

instruction has a quantifiable amount of information and requires a certain size to contain all 
the needed information. There are many ways to optimize the encoding of each instruction 
for a particular technology/architecture. In general, they follow a very simple rule: the most 
frequently used instructions should be as short as possible, cutting on bandwidth (dynamic 

25 code size) and code memory size (static code size) requirements. The impact of the 

encoding is influenced by many factors, but for the cost driven design this rule holds very 
well and tends to increase program code density. Following that rule may dictate variable 
size instructions, possibly at non-aligned addresses. 

However, for those processors which implement the non-aligned approach, code 
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fetches after a branch to a non-aligned address usually take extra memory cycles (when the 
target instruction length crosses the aligned memory access boundary). This has a major 
drawback of slowing down the execution and degrading processor performance. There are 
several solutions to this problem, most of which implement some type of caching scheme, 
5 which is relatively expensive. The most desirable approach, for the simpUcity of the code 
fetch and instruction alignment is the fully-aligned instruction set. 

However, for simple cost driven processor designs, in the fully-aligned 
abroach the reduced branch penalty will increase performance at the expense of increased 
code size. 

IQ What is needed is an optimal solution that takes advantage of the code packing 

density of the non-aligned architecture and of the fetch speed and simplicity of the fxiUy 
aligned architecture. 

SUMMARY OF THE INVENTION 
15 It is an object of the present invention to provide a high code packing density 

and a fast fetch following scheduled branches. 

It is another object of the present invention to allow instructions to be aligned 

on byte boundaries. 

It is also an objea of the present invention to align flow change target 
20 instructions, such as branch targets, on word boundaries where the size of a word is defined 
by the alignment and size of a memory read in the particular implementation. 

It is a further object of the present invention to provide a processor that 
performs a single or a multibyte instruction or code fetch. 

It is an additional object of the present invention to provide a larger branch 
25 target range for the program counter relative to branch instructions for a given offset size. 

The above objects can be attained by a processing system comprising 
a memory; 

a processor comprising 

30 - fetching means for fetching instruction information from the memory, a 

word at a time; 

. executing means for executing instructions obtained from the instruction 
information, at least some of the instructions having a'length intermediate 
between multiples of a word length; 
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instruction placing means for placing the instruction information in memory, the 
instruction placing means detecting for each instruction whether that instruction 
is a flow change target of a flow change instruction, the instruction placing 



5 



means placing each particular instruction so as to majumize code density in the 
memory, unless the particular instruction is a flow change target, in which case 
the instruction placing means place the particular instruction at a word 



boundary. 



Th processing system contains for example a microcontroller architecture and a compiler. 

The microcontroller performs unaligned multibyte fetches, jumps to word aligned branch 
10 addresses, allowing a word aligned fetch of the jump-to instruction and loads code into an 

instruction memory with branch instruction target addresses aligned on word boundaries. 

An embodiment of the processing system according to the invention comprises 

an instruction address modification unit coupled to said program counter register, for 

modifying a content of the program counter register under control of a flow change 
IS instruction which contains a reference to a flow change target, a least significant bit of the 

reference distinguishing between different words. Thus, the reference does not contain a bit 

to distinguish between different bytes within a word. For a given length of the reference 

there is therefore a larger branch target range. 

20 These together with other objects and advantages which will be subsequently 

apparent, reside in the details of construction and operation as more fully hereinafter 
described and claimed, reference being had to the accompanying drawings forming a part 
hereof, wherein like numerals refer to like parts throughout. 

25 BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 depicts the components of one microcontroller or processor according to 

the present invention; 



30 



Figs. 2 and 3 illustrate memory organization; 
Figs. 4 and S depict stacks; 

Fig. 6 depicts byte alignment in multibyte fetches; and 
Fig. 7 depicts program counter modification during fetching. 



DESCRIPTION OF THE PREFERRED EMBODIMENTS 

The present invention nwximizes the throughput of word-size code for 
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scheduled changes in program flow, such as occur in branches, jumps and procedure calls. 
To do this the processor or microcontroller, instruction set and assembler or compiler of the 
present invention aligns all flow change entry points, such as branch targets, on a word (two 
byte) boundary and performs code fetches of multiple bytes, preferably, but not limited to, 
5 two bytes or a word, although multiple word fetches are possible. This allows fast code 
fetches after a branch is taken. This architecture maximizes code density by allowing 
instructions to be aligned on any byte, fetches instructions in multibyte fetches and performs 
post fetch alignment in the fetch unit. The instruction size varies from one to seven bytes in 
length and is not restricted to the alignment and size of the multibyte fetch, also allowing 

10 high code density. Because the instructions can be fetched at byte boundaries but the branch 
targets are word boundary aligned, a larger target address range for branch target addresses 
is possible in the microcontroller with the same number of address bits because the target 
address need not carry the least significant bit that distinguishes between bytes and can 
instead carry an additional most significant bit. The microcontroller also suppoits unplanned 

IS changes in program flow, such as caused by excq>tions and interrupts, at any byte address by 
storing the fiill return address on the stack when an interrupt or an exception occurs. The 
return instruction pops the full address from the stack, aUowing program branch to a byte 
granularity address for these unscheduled intrarupts and Mceptions. Of course, returning 
into an odd address will usually be slower than returning into an even address. 

20 The architecture of the microcontroller system 10 of the present invention is 

illustrated in figure 1. This system 10 includes a single chip microcontroller 12 with 
separate internal instruction and data storage. The microcontroller 12 supports external 
devices 14 and 16 and, through 20/24 bit external address capability, supports sixteen 
megabytes of external instruction storage 18 and sixteen megabytes of external data storage 

25 20. The microcontroller 12 includes a bus interface unit 22 which communicates with the 
external memories 18 and 20 over a 16 bit external bi-directional address and data bus 24 
where the addresses are transferred in two cycles and a portion of the 20/24 bit address is 
latched in an external latch (not shown). The microcontroller 12 communicates with the 
external devices 14 and 16 through I/O ports 26-28 which are addressable as special function 

30 registers (SFR) 40. The ports 26-28, as well as other special function registers, are 

addressable over an internal peripheral bus 42 through the bus Interfax unit 22. The on-chip 
special function registers 40, some of which are bit addressable, also include a program 
status word (PSW) register 44 coupled to an interruption control unit 84 communicating with 
internal and external devices. The PSW register 44 is also connected to ALU 72, execution 
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unit 70 and decode unit 74 for flag and general status control. The registers 40 also include 
an interrupt register 44, timer registers SO and a system configuration register (SCR) 54 
containing system configuration bits. The program status word register 44 is addressable 
over the peripheral bus 42 for general register operations and is also addressable over a 
S connection to the internal bus 86 for other execution related operations. The bus interface 
unit 22 isolates the peripheral special function registers 40 ifrom the microcontroller core 60. 
The core 60 includes a microcoded execution unit 70 which controls execution of instructions 
by means of an ALU 72 and the other units. The instructions decoded by a decode unit 74 
are fetched from an internal EPROM 76, which is part of the instruction memory space, or 

10 from the external instruction memory 18 by a fetch unit 78. Static RAM 80, which is part 
of the data memory space, as well as general purpose regist«s of a register file 82 are also 
available for instruction and data storage. 

The microcontroller 12 includes a memory organization as illustrated in figures 
2 and 3 where figure 2 illustrates the organization into pages and figure 3 depicts the 

IS organization of a page in more detail. As previously discussed, the microcontroller 12 has 
separate address spaces for instruction memory and data memory. The logical sqiaration of 
program and data memory allows concurrent access to botii memories. The microcontroller 
12 .supports up to 16 megabytes of separate data and program memory (with 24-bit 
addresses). The data memory space 118 is segmented into 64K byte pages 120 as illustrated 

20 in figure 3. There are four banks of byte registers RO through R7 (see figure 4) which are 
also mapped in data memory starting at address 0 in the on-chip RAM (in the register file 
82) and going up to address IF hexadecimal. One of the four banks is selected as the active 
bank by two bits in die PSW register 44. The selected bank appears as the general purpose 
registers. 

25 Memory in the system 10 is addressed in units of bytes, each byte consisting of 

8-bits. A word is a 16-bit value, consisting of two contiguous bytes. The storage order in 
the microcontroller 12 is •Utfle Endian" such that the lower byte of a word is stored at the 
lower address and the higher byte is stored at the next higher address. AU 16-bit word 
addressable locations could be accessed as both bytes and words. The external bus 24 can be 

30 configured in 8 or 16-bit mode, selected during chip reset. Depending on the mode of 

operation selected, all 16-bit external accesses could be strictiy words (16-bit mode) or bytes 
from consecutive memory locations (8-bit mode). An external word fetch in 8-bit mode 
results in 2 separate byte accesses (the result is the same in a single word access if the data is 
on-chip). 
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As previously stated, the microcontroller 12 supports a program memory 18 
with an addressable space of 16 megabytes. The instruction set includes jumps and calls, 
some of which operate only on the local code space, some of which can access the entire 
program memory space, and some of which are register indirect. As discussed in more 
5 detail later, program memory target addresses referenced by jumps, calls, branches, traps 
and interrupts, under microcode program control, are word aligned. However, the return 
address from subroutines or interrupt handlers can be on either odd or even byte boundaries. 
For instance, a branch instruction may occur at any code address, but it may only branch to 
an even address. Branch address alignment provides two benefits: 1) branch ranges are 

10 doubled without providing an extra bit in the instruction, and 2) braiiched-to code executes 
faster if it is word aligned because the first two bytes of the instruction (a word) are fetched 
simultaneously* In the microcontroller 12 the stack as illustiated in figures 4 and S grows 
downward from high to low addresses. The microcontroIlCT 12 architecture supports a UFO 
(last-in first-out) stack. At any given time, the stack pointer (SP) points to the last word 

IS pushed onto the stack. When new data is pushed, the stack pointer is decremented pri r to 
writing to memory. When data is popped from the stack, the stack pointer is incremented 
after the data is read from memory. Since the microcontroller 12 stores data in the memory 
most significant bit (MSB) first, the stack pointer always points to the least significant bit 
(LSB) of a word written onto the stack. This matches the way a general purpose pointer 

20 accesses data from memory, so that the stack pointer may t>e copied to a general purpose 
pointer register and used to access parameters that reside on the stack. Stack operations are 
facilitated by two stack pointers a user stack pointer (USP) and a system stack pointer (SSP) 
located in the registers of register file 82. The 16-bit stack pointers are customary top-of* 
stack pointers, addressing the uppermost datum on a push-down stack. It is referenced 

25 implicitly by PUSH and POP operations, subroutine calls, returns and trap/exception 
interrupt operations. The stack is always WORD aligned. Any PUSH to the stack 
(byte/word) decrements the stack pointer by two (SP = SP-2) and any POP (byte/word) 
increments the stack pointer by two (SP = SP+2). The stack alignment thus ensures that all 
stack operations are on word boundaries (even addresses), eliminating alignment issues and 

30 reducing the interrupt latency time as well as for other 16-bit or larger stack operations. 
Since SP is pre-decremented prior to a PUSH, a word-ali«ned stack would grow from FE 
downwards. In multitasking systems one stack pointer is used for the supervisory system 
and another for the currendy active task. This helps in the protection mechanism by 
providing isolation of system software from user applications. The two stack pointers also 
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help to improve the performance of intenupts. The two stack pointers share the same 
register address. The stack pointer that will be used at any given time, and that will 
'appear* in the register file, is determined by the system mode bit (SM) in the program 
status word (PSW) register 44. In the user mode, all pushes, pops, and subroutine return 
5 addresses use the application or user stack. Interrupts, however, always use the system 

stack. As previously mentioned,^there are eight 16-bit r^istns in the register file. Of those 
eight, one is reserved for the stack pointer (R7) and the other seven may be used as general 
purpose pointer registers to access the different segments of the memory. A "byte* register 
in the SFR space contains bits that are associated with each of the seven general purpose 

10 pointer registers (i.e. not the SP) that selects either DS or ES register as the source for the 
most significant 8-bit for the 24-bit address for indirect addressing modes. This register is 
called the segment select register. 

Exceptions and interrupts are events that pre-empt normal instruction processing 
and are unplaimed or unexpected/unscheduled changes in program flow. Each interrupt or 

15 excq>tion has an assigned vector that points to an associated handler routine. Excq)tion and 
interrupt processing includes all operations required to transfer control to a handler routine, 
but does not include execution of the handler routine itself. An excepdon/interxupt vector 
includes the address of a routine that handles an excq)tion. Exception/interrupt vectors are 
contained in a data structure called the vector table, which is located in the first 256 bytes, of 

20 code memory page 0. All vectors consist of 2 words which are 0) the address of the 
exception handler with the procedure entry point located on a word boundary and (ii) the 
initial PSW contents for the handler. All exceptions and interrupts other than RESET cause 
the current program counter (PC) and PSW values to be stored on the stack and are serviced 
after the completion of the current instruction based on their priority level. During an 

25 exception or an interrupt, the entire 24-bit return address and the current PSW word are 

pushed onto the stack. The stacked PC (hi-byte): PC (lo-word) value is the 24-bit address of 
the next instruction in the current instruction stream. The program counter (PQ is then 
loaded with the address of the corresponding handler routine from the vector table and the 
PSW is then loaded with a new value stored in the upper word of the corresponding vector. 

30 Execution of the exception or interrupt handler proceeds until the return from interrupt 

(RETI) instruction is encountered or by another exception or an interrupt of higher priority. 
The RETI instruction terminates each handler routine. Under microcode program control 
tills pops the entire 24 bit return address from the stack into tiie PC, reloads the original 
PSW from the stack and causes the processor to resume execution of the interrupted routine. 
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There are several ways in which code or instruction addresses may be formed 
to execute instructions on the microcontroller 12. Scheduled or planned changes in the 
program flow are done with simple relative branches, long relative branches, 24-bit jumps 
and calls, 16-bit jumps and calls, and returns. Simple relative branches use an 8-bit signed 
displacement added to the program counter (PC) to generate the new code address. The 
calculation is accomplished by shifting the 8-bit relative displacement left by one bit (since it 
is a displacement to a word address), sign extending the result to 24-bits, adding it to the 
program counter contents, and forcing the least significant bit of the result to zero. The long 
relative unconditional branch (IMP) and call with 16-bit relative displacements uses the same 
sequence. Far jumps and calls include a 24-bU absolute address in the instruction and simply 
replace the entire program counter contents witii the new value. Return instructions obtain 
an address from the stack, which may be either 16 or 24-bits in length, depending on the 
type of return and the setting of a page zero mode bit in the SCR register. A 24-bit address 
will simply r^lace the entire program counter value. A 16-bit return address rq>laces only 
the bottom 16 bits of the PC in page zero mode, where the upper 8 bits of the PC are 
assumed 0. Code addresses can be generated by using a 16-bit value from a points register 
appended to either the top 8 bits of the program counter (PQ or the code segment (CS) 
register to form a 24-bit code address. The source for the upper 8 address bits is determined 
by the setting of the segment selection bit (0 - PC and 1 » CS) in the SSEL register that 
corresponds to the pointer register that is used. Note that the CS is an 8-bit SFR. 

The fetch operation that allows fetched instructions to be aligned either on a 
word or byte boundary is performed by a combination of a conventional non-aligned code 
fetch and conventional aUgnment circuits in the fetch unit 78 and the decode unit 74, as 
illustrated in figure 6. A conventional prefetch queue 200 receives words from code or 
instruction memory 76 or 18 and presents the words to a conventional aUgnment multiplexer 
202. The multiplexer 202 selects tiie appropriate byte(s) to present to conventional decode 
logic 204 of the decode unit 74. The decode logic 204 decodes the instruction and presents 
Uie decoded instruction to tiie other units of tiie core^ such as the execution unit 70, Uirough 
conventional staging registers 206. In this way instructions can be byte aligned and b 
staged in their proper order. 

To faciUtate a flow transfer to a word aligned target during an unplanned 
program flow change and a return from the unplanned program flow change to a byte aligned 
rettim target, a program counter value in a program counter register 220 is adjusted using a 
circuit as iUustrated in figure 7. During normal sequential instruction execution operations 
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where program flow is sequential, after each instruction is executed the most significant bits 
(MSB) content of the program counter register 220 is provided directly to an adder 222. The 
least significant bit (LSB) is provided to the adder 22 through an AND gate 223 as long as 
there is not a branch being indicated by the staging registers 206. The adder 222 adds the 
5 instruction length, provided by the decoder staging registers 206 through a multiplexer 224, 
to the program counter value and the itpdated PC is stored back in the program counter 
register 220 through a multiplexer 226. When a scheduled flow change occurs, such as 
when a branch instruction jump, or call is presented by the staging registers 206, the AND 
gate 223 prevents the least significant bit from being presented to the adder 222. The 

10 multiplexer 224, instead of presenting the instruction length, presents the most significant bits 
(the word address) of the branch offset and a forced *0" in the least significant bit to the 
adder 222. The adder 222 adds the LSB augmented offset to the program counter value and 
this value is stored in the program counter register 220. When an unexpected change in flow 
occurs, such as in an exception or interrupt, then upon a return to the original flow, such as 

IS a return from an interrupt, the multiplexer 226 loads the program counter register 220 with 
the full return address from the internal bus 86 obtained, for example, from the stack in the 
case of an interrupt. 

As previously mentioned the present invention requires that the transfer target 
addresses fall on a word boundary in some cases. A situation where a transfer target is not 

20 word aligned occurs when an instruction aligned on a word boundary has an odd number of 
bytes as shown below. 

Address Instruction Size (Bytes) Offset (bits) 
0000 BNELl 2 

0002 MOV.B(R0+l),R4 3 8 

25 0005 LI: 

In this example the branch not equal instruction (BNE) occupies two bytes and the move 
instruction (MOV.B) is three bytes putting the next instruction of the branch target ("LI") at 
an odd byte address. An assembler or compiler can solve this problem and create or produce 
object code with word aligned jump target addresses in a number of different ways. In one 
30 approach in an assembler or compiler, when the symbol table is being built and a jump target 
is assigned a symbolic name in the symbol name table, the symbol table can be assigned to 
include a field indicating wheUier the final address must be aUgned on a word boundary. 
When tiie jump target is encountered if the value of the location pointer or counter is not at a 
word boundary, that is, the least significant bit is not binary "O* or the location pointer 
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contains an odd value, a single NOP instruction can be inserted at the location and the 
location pointer incremented to the next byte. This results in the next instruction and the 
symbol being word aligned, the value of the location is assigned to the symbolic name of the 
jump target and, as a result, the address assigned to the jump target is forced to the next 
5 word aligned location counter address as shown below* 
Address Instruction Size (Bytes) Offset (bits) 
0000 BNE LI 2 

0002 MOV.B(R0+i),R4 3 8 

OOQS NOP 1 
10 0006 LI: 

This solution results in a wasted instruction (NOP) that the execution unit must process. 
This wastes several clock cycles but may be acceptable in some cases, such as in the case 
where the extra NOP is inserted in loop parameter initializati<m and the aligned labd is a 
loop entry whidi may be branched to many times. Another solution is to S.have the 
15 compiler scan the contiguous code and identify jump targets that are on byte boundaries and 
determine if any instruction prior to the target instruction can be expanded into an instruction 
that results in a word alignment for the target. If so the instruction is expanded, if not, a 
NOP can be inserted. The expansion to force a word alignment is shown below. 
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Address Instruction Size (Bytes) Offset (bits) 
0000 BNE LI 2 
0002 MOV.B (R0+0001),R4 4 16 
0006 LI: 

5 For this example an extra byte of aU "O" is added to the relative offset. By using a compUer 
or assembler to produce code with jump targets at word boundaries, the penalty for a 
planned change in program flow in an unaligned code processor is mititf^^y^ and overall 
processor throughput increases. 

The many features and advantages of the invention are apparent from the 

10 detailed specification and, thus, it is intended by the appended claims to cover all such 
features and advantages of the invention which fall within the true spirit and scope of the 
invention. Further, since numerous modifications and changes will readily occur to those 
skilled in the art, it is not desired to limit the invention to the exact construction and 
operation illustrated and described. For example, the present invention can be used when 

15 alignment is desired on a four byte boundary. Accordingly all suitable modifications and 
equivalents may be resorted to, fidling within die scope of the invention. 
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CLAIMS: 



1. A processing system comprising 
a memory; 

a processor comprising 

- fetching means for fetching instruction information from the memory, a 
S word at a time; 

• executing means for executing instructions obtained from the instruction 
information, at least some of the instructions having a length intermediate 
between multiples of a word length; 
* instruction placing means for placing the instruction information in memory, the 

10 instruction placing means detecting for each instruction whether Oat instruction 

is a flow change target of a flow change instruction, the instruction placing 
means placing each particular instruction so as to maximize code density in the 
memory, unless the particular instruction is a flow change target, in which case 
the instruction placing means place the particular instruction at a word 
15 boundary, 

2. A processing system as recited in claim 1, further comprising transfer means 
for returning from unplanned program flow changes to an instruction which is not placed at a 
word boundary. 

3. A processing system as recited in claim 1, wherein said instruction placing 
20 means comprises a compiler performing instruction expansion and code manipulations to 

place instruction that are flow change targets on word boundaries. 

4. A processing system as recited on claim 1, wherein the executing means 
comprise: 

an instruction queue receiving byte aligned instructions; and 
25 an alignment multiplexer coupled to said queue and aligning the instructions for 

execution. 

5. A processing system as recited in claim 1, wherein the executing means 



comprise: 



a program counter register designating addresses of instructions fetched; and 
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an instruction address modification unit coupled to said program counter 
register, for modifying a content of the program counter register under control of a flow 
change instruction which contains a reference to a flow change target, a least significant bit 
of the reference distinguishing between different words. 
6. A processing system, comprising: 

a compiler performing instruction expansion and code manipulations to align 
transfer target instructions on word boundaries; 
a processor, comprising: 

an instruction queue receiving byte aligned instructions; 
an alignment multiplexer coupled to said queue and aligning the instructions for 
execution; 

a program counter register designating addresses of instructions fetched; 

an instruction address modification unit coupled to said program counter 
register and modifying a program counter using a word address; 

transfer means for returning from unplanned program flow changes using byte 
aligned addresses; and 

execution means for executing the fetched instructions. 

7. A method of executing computer instructions, comprising: 

a. aligning flow change target instructions on word boundaries; 
20 b. aligning sequentially executable instructions on byte boundaries; and 

c. fetching and executing instructions using multiple byte fetching aligned at 
word boundaries. 

8. A method as recited in claim 7, further comprising: 

d. performing unplanned returns using byte aligned program transfer addresses. 
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